Discovering terms using statistical corpus analysis

ABSTRACT

Software that extracts contextually relevant terms from a text sample (or corpus) by performing the following steps: (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

BACKGROUND OF THE INVENTION

The present invention relates generally to the field of natural languageprocessing, and more particularly to “term extraction.”

Natural language processing (NLP) is a field of computer science,artificial intelligence, and linguistics concerned with the interactionsbetween computers and human (natural) languages. As such, NLP is relatedto the area of human-computer interaction. Many challenges in NLPinvolve natural language understanding (that is, enabling computers toderive meaning from human or natural language input).

Information Extraction (IE) is a known element of NLP. IE is the task ofautomatically extracting structured information from unstructured(and/or semi-structured) machine-readable documents. Term Extraction isa sub-task of IE. The goal of Term Extraction is to automaticallyextract relevant terms from a given text (or “corpus”). Term Extractionis used in many NLP tasks and applications, such as question answering,information retrieval, ontology engineering, semantic web, textsummarization, document classification, and clustering. Generally, interm extraction, statistical and machine learning methods may be used tohelp select relevant terms.

Domain ontologies are known. A domain ontology represents concepts whichbelong to a particular “domain” such as an industry or a genre. In fact,multiple domain ontologies may exist within a single domain due todifferences in language, intended use of the ontologies, and differentperceptions of the domain. However, since domain ontologies representconcepts in very specific and often eclectic ways, they are oftenincompatible. In the context of NLP, term extraction becomes difficultwhen the text being processed belongs to a different domain (forexample, medical technology) than the domain from which the NLP softwarewas built (for example, financial news).

SUMMARY

According to an aspect of the present invention, there is a method,computer program product and/or system that performs the following steps(not necessarily in the following order): (i) identifying a first termfrom a corpus, based, at least in part, on a set of initial contextualcharacteristic(s), where each initial contextual characteristic of theset of initial contextual characteristic(s) relates to the contextualuse of at least one category related term of a set of category relatedterm(s) in the corpus; (ii) adding the first term to the set of categoryrelated term(s), thereby creating a revised set of category relatedterm(s) and a set of first term contextual characteristic(s), where eachfirst term contextual characteristic of the set of first term contextualcharacteristic(s) relates to the contextual use of the first term in thecorpus; and (iii) identifying a second term from the corpus, based, atleast in part, on the set of first term contextual characteristic(s).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram view of a first embodiment of a systemaccording to the present invention;

FIG. 2 is a flowchart showing a first embodiment method performed, atleast in part, by the first embodiment system;

FIG. 3 is a block diagram view of a machine logic (for example,software) portion of the first embodiment system;

FIG. 4 is a flowchart view of a method according to the presentinvention;

FIG. 5 is a flowchart view of a method according to the presentinvention;

FIG. 6 is a flowchart view of a method according to the presentinvention;

FIG. 7 is a flowchart view of a method according to the presentinvention;

FIG. 8 is a flowchart view of a method according to the presentinvention;

FIG. 9 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 10 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 11 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 12 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 13 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 14 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention;

FIG. 15 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention; and

FIG. 16 is a table view showing information that is generated by andhelpful in understanding embodiments of the present invention.

DETAILED DESCRIPTION

Some embodiments of the present invention extract contextually relevantterms from a text sample (or corpus) by iteratively discovering newterms using weighted “contextual characteristics” of terms discovered inprevious iterations. Roughly speaking, a “contextual characteristic” isa feature of a term derived from that term's particular usage in a givencorpus (for example, one contextual characteristic is a list of wordsthat commonly precede or follow a given term in the corpus). ThisDetailed Description section is divided into the following sub-sections:(i) The Hardware and Software Environment; (ii) Example Embodiment;(iii) Further Comments and/or Embodiments; and (iv) Definitions.

I. The Hardware and Software Environment

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

An embodiment of a possible hardware and software environment forsoftware and/or methods according to the present invention will now bedescribed in detail with reference to the Figures. FIG. 1 is afunctional block diagram illustrating various portions of networkedcomputers system 100, including: sub-system 102; client sub-systems 104,106, 108, 110, 112; communication network 114; computer 200;communication unit 202; processor set 204; input/output (I/O) interfaceset 206; memory device 208; persistent storage device 210; displaydevice 212; external device set 214; random access memory (RAM) devices230; cache memory device 232; and program 300.

Sub-system 102 is, in many respects, representative of the variouscomputer sub-system(s) in the present invention. Accordingly, severalportions of sub-system 102 will now be discussed in the followingparagraphs.

Sub-system 102 may be a laptop computer, tablet computer, netbookcomputer, personal computer (PC), a desktop computer, a personal digitalassistant (PDA), a smart phone, or any programmable electronic devicecapable of communicating with the client sub-systems via network 114.Program 300 is a collection of machine readable instructions and/or datathat is used to create, manage and control certain software functionsthat will be discussed in detail, below, in the Example Embodimentsub-section of this Detailed Description section.

Sub-system 102 is capable of communicating with other computersub-systems via network 114. Network 114 can be, for example, a localarea network (LAN), a wide area network (WAN) such as the Internet, or acombination of the two, and can include wired, wireless, or fiber opticconnections. In general, network 114 can be any combination ofconnections and protocols that will support communications betweenserver and client sub-systems.

Sub-system 102 is shown as a block diagram with many double arrows.These double arrows (no separate reference numerals) represent acommunications fabric, which provides communications between variouscomponents of sub-system 102. This communications fabric can beimplemented with any architecture designed for passing data and/orcontrol information between processors (such as microprocessors,communications and network processors, etc.), system memory, peripheraldevices, and any other hardware components within a system. For example,the communications fabric can be implemented, at least in part, with oneor more buses.

Memory 208 and persistent storage 210 are computer-readable storagemedia. In general, memory 208 can include any suitable volatile ornon-volatile computer-readable storage media. It is further noted that,now and/or in the near future: (i) external device(s) 214 may be able tosupply, some or all, memory for sub-system 102; and/or (ii) devicesexternal to sub-system 102 may be able to provide memory for sub-system102.

Program 300 is stored in persistent storage 210 for access and/orexecution by one or more of the respective computer processors 204,usually through one or more memories of memory 208. Persistent storage210: (i) is at least more persistent than a signal in transit; (ii)stores the program (including its soft logic and/or data), on a tangiblemedium (such as magnetic or optical domains); and (iii) is substantiallyless persistent than permanent storage. Alternatively, data storage maybe more persistent and/or permanent than the type of storage provided bypersistent storage 210.

Program 300 may include both machine readable and performableinstructions and/or substantive data (that is, the type of data storedin a database). In this particular embodiment, persistent storage 210includes a magnetic hard disk drive. To name some possible variations,persistent storage 210 may include a solid state hard drive, asemiconductor storage device, read-only memory (ROM), erasableprogrammable read-only memory (EPROM), flash memory, or any othercomputer-readable storage media that is capable of storing programinstructions or digital information.

The media used by persistent storage 210 may also be removable. Forexample, a removable hard drive may be used for persistent storage 210.Other examples include optical and magnetic disks, thumb drives, andsmart cards that are inserted into a drive for transfer onto anothercomputer-readable storage medium that is also part of persistent storage210.

Communications unit 202, in these examples, provides for communicationswith other data processing systems or devices external to sub-system102. In these examples, communications unit 202 includes one or morenetwork interface cards. Communications unit 202 may providecommunications through the use of either or both physical and wirelesscommunications links. Any software modules discussed herein may bedownloaded to a persistent storage device (such as persistent storagedevice 210) through a communications unit (such as communications unit202).

I/O interface set 206 allows for input and output of data with otherdevices that may be connected locally in data communication with servercomputer 200. For example, I/O interface set 206 provides a connectionto external device set 214. External device set 214 will typicallyinclude devices such as a keyboard, keypad, a touch screen, and/or someother suitable input device. External device set 214 can also includeportable computer-readable storage media such as, for example, thumbdrives, portable optical or magnetic disks, and memory cards. Softwareand data used to practice embodiments of the present invention, forexample, program 300, can be stored on such portable computer-readablestorage media. In these embodiments the relevant software may (or maynot) be loaded, in whole or in part, onto persistent storage device 210via I/O interface set 206. I/O interface set 206 also connects in datacommunication with display device 212.

Display device 212 provides a mechanism to display data to a user andmay be, for example, a computer monitor or a smart phone display screen.

The programs described herein are identified based upon the applicationfor which they are implemented in a specific embodiment of theinvention. However, it should be appreciated that any particular programnomenclature herein is used merely for convenience, and thus theinvention should not be limited to use solely in any specificapplication identified and/or implied by such nomenclature.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the invention.The terminology used herein was chosen to best explain the principles ofthe embodiment, the practical application or technical improvement overtechnologies found in the marketplace, or to enable others of ordinaryskill in the art to understand the embodiments disclosed herein.

II. Example Embodiment

FIG. 2 shows flowchart 250 depicting a method according to the presentinvention. FIG. 3 shows program 300 for performing at least some of themethod steps of flowchart 250. This method and associated software willnow be discussed, over the course of the following paragraphs, withextensive reference to FIG. 2 (for the method step blocks) and FIG. 3(for the software blocks).

The present embodiment refers extensively to a high precision domainlexicon (HPDL). The HPDL (also referred to as a “set of category relatedterms”) is a collection of terms (words or sets of words) that belong toa specific domain, category, or genre (“domain”). In term extraction,and more generally in natural language processing, the HPDL can serve asan underlying “knowledge base” for a given domain so as to extract morecontextually relevant terms from a piece of text (or corpus). In manyembodiments of the present invention, the HPDL is used to: (i) extractcontextually relevant terms (term extraction); and (ii) extractadditional HPDL-eligible terms in order to grow, strengthen, and/orexpand the HPDL.

HPDL domains may have multiple categories (or sub-domains). For example,the domain of smartphones may include categories such as smartphonemodels, smartphone apps, and/or smartphone modes. It is contemplatedthat the present invention may apply to HPDLs with singular domains,multiple domains, and/or multiple domain categories (or sub-domains).

In some embodiments of the present invention, method 250 may begin withan existing, predefined HPDL, while in other embodiments the HPDL may beinitially extracted from the corpus using, for example, term extractionmethods adapted to achieve high levels of precision. Some known methodsfor extracting an initial HPDL from the corpus are discussed below inthe Further Comments and/or Embodiments Sub-Section of this DetailedDescription. In the present example embodiment, the HPDL has a domain of“things that jump” and initially includes the following terms: (i) fox;and (ii) rabbit (in other embodiments, an HPDL including the terms “fox”and “rabbit” might also have a domain of “animals” and a sub-domain of“mammals”).

The present embodiment also refers extensively to a corpus. The corpusis a text sample that method 250 extracts relevant terms from. In otherwords, the corpus is the text that is being acted upon (interpreted,processed, classified, etc.) during term extraction. In the presentexample embodiment, the corpus includes the following text: “A quickbrown fox jumps over the lazy dog, but a quicker, more nimble kangaroojumps over the fox. The following day, while the kangaroo leaps over thestill-lazy dog, a determined frog leaps over a surprisingly speedysloth.”

Processing begins at step S255, where extract candidate terms module(“mod”) 302 extracts candidate terms (also referred to as “relevantterms”) from the corpus. In many embodiments, various statisticalmethods are used to extract relevant candidate terms. A number of theseknown methods are discussed below in the Further Comments and/orEmbodiments Sub-Section of this Detailed Description. However, these arenot meant to be all-inclusive or limiting, as other, less traditionalextraction methods may also be used. In other embodiments of the presentinvention, dictionaries or domain lexicons different from and/orunrelated to the HPDL may be used in this step. For example, in thepresent example embodiment, candidate terms are extracted from thecorpus if they are identified as “animals”. As such, the following termsare extracted from the corpus: (i) fox; (ii) dog; (iii) kangaroo; (iv)frog; and (v) sloth. Furthermore, terms that are already in the HPDL areexcluded from the candidate terms list. Therefore, “fox” is not includedin the candidate terms list, and the resulting list is as follows: (i)dog; (ii) kangaroo; (iii) frog; and (iv) sloth.

Processing proceeds to step S260, where discover new generation mod 304discovers a new generation of HPDL terms from the candidate terms usingthe HPDL and its contextual characteristics. This step begins byidentifying contextual characteristics (or “initial contextualcharacteristics”) of the terms in the HPDL. A contextual characteristicis a feature of a term derived from that term's particular usage in agiven corpus (for a more complete definition of “contextualcharacteristic,” see the Definitions Sub-Section of this DetailedDescription). In the present example embodiment, the contextualcharacteristic for each term in the HPDL is the word immediatelyfollowing that term in the corpus (when a term is the last word in asentence, it does not have a contextual characteristic). So, in thepresent embodiment, the only contextual characteristic for the term“fox” (the first HPDL term) is “jumps”, because the only wordimmediately following “fox” in the corpus is “jumps”. For the secondHPDL term, “rabbit”, there are no contextual characteristics, because“rabbit” does not appear in the corpus. As such, the only contextualcharacteristic of the HPDL is the word “jumps”. It should be noted thatalthough the present embodiment includes a simple example with onecontextual characteristic, in many embodiments the HPDL has a pluralityof context characteristics.

Once contextual characteristics for the HPDL have been identified, thosecharacteristics are then applied to the candidate terms. In the presentexample, the only candidate term to immediately precede the word “jumps”is “kangaroo”. As such, “kangaroo” (the “first term”) is the only termincluded in the current generation of discovered terms. In otherembodiments of the present invention, however, the current generationmay include a plurality of discovered terms. In those embodiments,additional steps may be taken to further refine the list of discoveredterms (for some examples, see the Further Comments and/or EmbodimentsSub-Section of this Detailed Description).

Processing proceeds to step S265, where update terms mod 306 adds thecurrent generation of terms to the HPDL. In the present embodiment, theterm “kangaroo” is added to the HPDL, with the resulting HPDL (or“revised set of category related terms”) being as follows: (i) fox; (ii)rabbit; and (iii) kangaroo.

Processing proceeds to step S270, where update terms mod 306 deletes thecurrent generation of terms from the candidate terms list. In thepresent embodiment, the term “kangaroo” is removed from the candidateterms list, with the resulting candidate terms list being as follows:(i) dog; (ii) frog; and (iii) sloth.

Processing proceeds to step S275, where iterate mod 308 checks to see ifmethod 250 is on its last iteration. In the present embodiment, a totalof two iterations are to be performed. As such, method 250 is not on itslast iteration (NO), and processing returns to step S260 for anotheriteration. In other embodiments, however, other tests may be used. Forexample, in one embodiment, iterations may occur until the HPDL reachesa certain size. In another embodiment, iterations may continue to occurfor a certain period of time. In still other embodiments, iterations maycontinue to occur indefinitely and/or until no further terms for theHPDL are discovered.

In the present example, upon returning to step S260, discover newgeneration mod 304 repeats the process of identifying contextualcharacteristics of the terms in the HPDL. However, this time, there isan additional term (“kangaroo”) in the HPDL. As a result, an additionalcontextual characteristic (or “first term contextual characteristic”) isidentified: the word “leaps”, which immediately follows the word“kangaroo” in the second sentence of the corpus. As such, when theupdated contextual characteristics are applied to the candidate terms,an additional match is found: the term “frog” appears immediately beforethe word “leaps” in the corpus. As a result, “frog” (the “second term”)is added to the current generation of discovered terms.

Processing proceeds to step S265, where update terms mod 306 adds “frog”to the HPDL, resulting in the following HPDL: (i) fox; (ii) rabbit;(iii) kangaroo; and (iv) frog. Processing then proceeds to step S270,where update terms mod 306 removes “frog” from the candidate terms list,with the resulting candidate terms list being as follows: (i) dog; and(ii) sloth.

Processing proceeds to step S275. In the present example, two iterationshave now completed, which means that method 250 is on its finaliteration. Therefore, step S275 resolves to “YES”, and processingproceeds to step S280, where method 250 ends. As a result of executingthe method 250, the HPDL for the domain of “things that jump” nowincludes two additional terms (“kangaroo” and “frog), and will be ableto further extract contextually relevant terms in future iterationsand/or from different corpuses. Additionally, system 102 now also has alist of candidate terms (“dog” and “sloth”) that may be helpful forother NLP-related tasks.

III. Further Comments and/or Embodiments

Some embodiments of the present invention recognize the following facts,potential problems and/or potential areas for improvement with respectto the current state of the art: (i) some existing approaches (includingapproaches that rely on linguistic processors to extract candidateterms) do not perform well when the corpus (or text) has a differentgenre (or domain) than the corpus used to build the processor; (ii) someexisting approaches rely purely on statistical methods (such as n-gramsequences or topic modeling) to extract candidate terms, therebynegatively affecting system precision; (iii) existing approaches can beconfigured to provide terms with either high precision or high recall,but not both (thereby negatively affecting the overall accuracy of thesystem); and/or (iv) existing approaches are unable to discover newdomain-specific terms directly from the corpus in a bootstrappingmanner.

Some embodiments of the present invention may include one, or more, ofthe following features, characteristics and/or advantages: (i) usingcontextual similarity with a high precision domain lexicon for ranking;(ii) extracting candidate terms statistically; (iii) using an approachother than singular value decomposition; (iv) extracting terms withoutusing linguistic processors; (v) extracting terms without analyzinglinguistic or structural characteristics of a document; (vi) extractingterms without using syntactic and semantic contextual analysis; (vii)extracting terms without using dictionary-based statistics; (viii)extracting terms without using specialized corpora; (ix) extractingterms based on contextual information of a lexicon obtained from a givencorpus; and/or (x) using association rules to measure unithood and/orfilter candidate terms.

Many embodiments of the present invention are adapted to identify terms(nouns or noun phrases) from a corpus with both high precision andrecall without using any linguistic processors and open domainlinguistic resources (such as dictionaries and ontology). In doing so,these embodiments may include one, or more, of the following features,characteristics and/or advantages: (i) providing an iterative approachto term discovery where discovery depends on weighted contextualcharacteristics of already discovered terms in previous iterations; (ii)ranking pure statistically extracted candidate terms (N-grams) based ontheir noun specificity and term specificity determined using weightedcontextual similarity with known terms (nouns or noun phrases); (iii)using association rules, filtering candidate terms that cannot existindependently; and/or (iv) validating unithood of candidate terms usingassociation rules.

Some embodiments of the present invention may further include one, ormore, of the following helpful features, characteristics, and/oradvantages: (i) achieving positive results in entity set expansiontasks, where the goal is to identify entities from the corpus in abootstrapping manner; (ii) performing well on diverse domains such asmedical and/or news; (iii) performing better term extraction for anylanguage, including languages for which linguistic processors are notbuilt or do not perform well for; and/or (iv) keeping resources such asdictionaries, lexicons, ontologies, and/or entity lists up-to-date.

Method 400 according to the present invention is provided in FIG. 4.Method 400 is adapted to extract terms and their variants from a corpuswith both high precision and high recall. High precision and recalloccur even if the corpus has a domain that is different than thesystem's source domain, or if the corpus is in a language for whichlinguistic systems are not available or mature. Processing begins withstep S402, where the method 400 uses a statistical corpus analysis toextract candidate terms. This step S402 uses known (or to be known inthe future) statistical approaches (such as frequent item set mining,language mining, and topic modeling) to extract potential candidateterms from the corpus. Additionally, step S402 filters out irrelevantpotential candidate terms using statistical criteria.

Processing proceeds to step S404, where the method 400 creates a highprecision domain lexicon and analyzes contextual information therein.The high precision domain lexicon terms (or “lexicon terms”) are eithermanually extracted from the corpus or automatically extracted using anysystem (known or to be known in the future) configured to focus on highprecision. Once the lexicon terms have been extracted, context words ofthose lexicon terms (such as the words appearing before and after thelexicon terms in the corpus) are extracted and weighed to create a setof weighted term context words.

Processing proceeds to step S406, where the method 400 ranks thecandidate terms (see step S402) based on contextual similarity with theweighted term context words.

Processing proceeds to step S408, where method 400 selects the topcandidate terms as discovered terms. The number of top candidate termsto be selected is a preconfigured value that depends on application andbusiness context.

Processing proceeds to step S410, where the newly discovered terms (thatis, the top candidate terms) are added to the high precision domainlexicon. Once the newly discovered terms have been added, they aredeleted from the candidate terms list.

Processing proceeds to step S412, where the method 400 compares aniteration count with a pre-defined iteration threshold (where theiteration threshold is determined based on application and businesscontext). Processing then proceeds to step S414. If the iteration countis less than the iteration threshold (NO), processing returns to stepS404 to discover more relevant terms from the corpus. If the iterationcount is greater than or equal to the iteration threshold, however,processing for method 400 completes. As a result of method 400completing: (i) the high precision domain lexicon now includesadditional relevant domain terms; and (ii) the list of candidate termsincludes additional contextually relevant terms that may be used fornatural language processing or other tasks.

In some embodiments of the present invention, step S402 (ExtractCandidate Terms using Statistical Corpus Analysis, see FIG. 4) furtherincludes method 500 (see FIG. 5). Method 500 is adapted to applystatistic approaches to the corpus to extract potential candidate terms.The potential candidate terms are passed through statistical filters toidentify relevant terms, resulting in new candidate terms. Processingbegins with step S502, where method 500 extracts text from the corpusand then applies heuristics-based sentence splitters on the text toextract sentences.

Processing proceeds to step S504, where various statistical methods areapplied to the extracted sentences for candidate term extraction. Themethod to be used is typically determined based on a few factors: (i)the type of document the corpus is (for example, a web page, a textbook, or a manual); (ii) the length of the corpus (for example, thenumber of words in the corpus); and/or (iii) the general domain of thecorpus (for example, healthcare, finance, or telecommunication).Although many methods may be used, two are discussed below: (i) astatistical language modeling method (beginning with step S506); and(ii) an associated rule mining method (beginning with step S514).

If the statistical language modeling method is chosen, processingproceeds to step S506, where the method 500 extracts n-grams from eachextracted sentence in the corpus for each preconfigured value of n. Nmay be determined in a number of ways, including, for example, byconducting experiments or by prioritizing certain features (such asspeed vs. accuracy). An n-gram is a contiguous sequence of words from agiven extracted sentence, where ‘n’ represents the number of words inthe sequence. All unique n-grams of the corpus are considered aspotential candidate terms.

Processing proceeds to steps S508 and S510, where method 500 scores thepotential candidate terms based on their termhood and unithood,respectively. Termhood (as used in step S508) scores the validity of thepotential candidate term as a representative for the corpus content as awhole using one or more statistical measures now known (or to be knownin the future). In one embodiment, a measure of frequency in a corpus isused. In another embodiment, a measure of ‘weirdness’ (the term'sfrequency in the corpus compared to its frequency in a reference corpus)is used. In yet another embodiment, a measure of the pertinence orspecificity of the term to a particular domain is used.

Unithood (as used in step S510) scores the collocation strength (thestrength of parts of terms) of potential candidate terms using one ofthe statistical measures now known (or to be known in the future). Inone embodiment, a mutual information test is used. In anotherembodiment, a t-test is used.

Once steps S508 and S510 have completed, processing proceeds to stepS512. In step S512, potential candidate terms with termhood and unithoodscores above pre-defined thresholds are selected and identified ascandidate terms, and processing for method 500 completes.

If the associated rule mining method is chosen in step S504 (as opposedto the statistical language modeling method discussed above), processingproceeds to step S514. In this step, method 500 uses an algorithm toextract frequent n-grams. The extracted frequent n-grams are identifiedas potential candidate terms.

An example of a way to extract frequent n-grams (sets of words occurringin a specific order) is to extract n-grams meeting the followingcriteria: (i) the n-grams are frequent; and (ii) the order preserving asubset of the n-grams is frequent. In other words, in this embodimentmethod 500 first extracts unigrams (i.e. single words) that are frequent(that is, they have a frequency above a pre-defined threshold). Then,method 500 extracts frequent bigrams (i.e. two-word phrases) that aremade up of the previously identified unigrams. This continues for nsteps (where n equals the number of words in the phrase being analyzed).Another way to express this example extraction method is to say that forn>1, the system extracts n-grams that are frequent and also includefrequent (n−1)-grams. For n=1, the system extracts unigrams that arefrequent.

Processing proceeds to step S516, where the method 500 analyzes eachpotential candidate term and generates association rules along withcorresponding confidence values. To generate association rules, stepS516 performs three tasks: (i) for every potential candidate term t, allnon-empty ordered subsets s are generated; (ii) for every subset s of t,a forward rule,“s->t−s”, along with its confidence (measured asfrequency of t divided by the frequency of s) is generated; and (iii)for every subset s of t, an inverse rule, “s<-t−s”, along with itsconfidence, is generated. To provide an example, in one embodiment ofthe invention, term t is “Mobile Phone A”. Applying task (i), thesubsets for “Mobile Phone A” are: (a) “Mobile”; (b) “Phone”; (c) “A”;(d) “Mobile Phone”; and (e) “Phone A”. Applying task (ii), a forwardrule for “Mobile Phone A” is “Mobile Phone->A”. And applying task (iii),an inverse rule is “Mobile<-Phone A”.

Processing proceeds to step S518, where n-grams are filtered using theinverse rules created in step S516. In this step, term variations areidentified and removed based on their confidence score. The method 500identifies a term variation if an inverse rule from term variation toterm has a confidence score above a predefined threshold (determinedexperimentally, for example). For example, “Mobile Phone A” has oneinverse rule with a confidence score above the predefined threshold:“Mobile<-Phone A”. Because the confidence score is over the threshold,the method 500 identifies that “Phone A” is a variation of “Mobile PhoneA” and removes “Phone A” from the list of potential candidate terms.

Processing proceeds to step S520, where n-grams are filtered usingforward rules (which serve as a measure of unithood for potentialcandidate terms). The confidence of forward rules provides theprobability of the order of term constituents. If none of the forwardrules for a term have a confidence score above a pre-defined threshold,that term is removed from the potential candidate term. For example, theterm “Manufacturer launches new” has two forward rules: “Manufacturerlaunches->new” and “Manufacturer->launches new”. Because neither of theforward rules have a confidence level above the threshold, they areremoved from the list of potential candidate terms.

Upon completing step S520, processing for method 500 completes,resulting in a new list of candidate terms from the remaining potentialcandidate terms.

In some embodiments of the present invention, step S404 (AnalyzingContextual Information of High Precision Domain Lexicon, see FIG. 4)further includes method 600 shown in FIG. 6. Processing begins with stepS602, where a high precision domain lexicon is created (either manuallyor automatically using methods configured to focus on high precision) orprovided from previous iterations of method 600. Term variations fromthe lexicon are then filtered and/or replaced using previously generatedinverse rules, if available (for example, from step S518 (see FIG. 5)).A term from the lexicon is identified as a term variation if an inverserule from the term to some other longer term has a confidence scoreabove pre-defined threshold. If the longer term is a part of thelexicon, then the term identified as a term variation is removed. Forexample, if “Mobile Phone A” and “Phone A” are in the lexicon andinverse rule “Mobile<-Phone A” has a confidence level above thethreshold, then “Phone A” is removed from the lexicon. If the longerterm is not part of lexicon, then the lexicon term is replaced withlonger term. For example, if “Phone” is in the lexicon and the inverserule “Mobile<-Phone” has a confidence level above the threshold, then“Phone” is replaced with “Mobile Phone”.

Processing proceeds to step S604, where method 600 scores lexicon termsand ranks them based on their scores. Scoring may be performed by avariety of methods now known (or to be known in the future), and may bebased on properties such as term frequency observed in a given corpus.Processing proceeds to step S606, where the top X terms are selected,where X is pre-defined (and determined experimentally, for example).

Processing proceeds to step S608, where context words are extracted fromthe corpus. First, occurrences of lexicon terms within a given corpusare identified. Then, for each occurrence, context words are extractedper a pre-defined window size (for example, the two words before and thetwo words after a lexicon term). The words within the window areidentified as context words and added to a list of context words.Processing proceeds to step S610, where closed class context words (suchas determiners, prepositions, pronouns, and/or conjunctions) are removedfrom the list of context words.

Processing proceeds to step S612, where each context word is weighted.In some embodiments, the weight of a context word equals the number ofunique terms the context word appears in divided by the total number ofterms. Processing for method 600 concludes with a list of weighted termcontext words.

In some embodiments of the present invention, step S406 (ContextualSimilarity based ranking of Candidate Terms, see FIG. 4) furtherincludes method 700 shown in FIG. 7. Processing begins with step S702,where context words for candidate terms are extracted. First,occurrences of each candidate term (see discussion of method 500, above)in the corpus are identified. Then, from each occurrence, context wordsare extracted per a pre-defined window size (for example, the two wordsbefore and the two words after each candidate term).

Processing proceeds to step S704, where closed class context words (suchas determiners, prepositions, pronouns, and/or conjunction) are removedfrom the list of context words. The remaining context words (“candidateterm context words”) are selected, stored (along with their frequency),and mapped to their corresponding candidate terms.

Processing proceeds to step S706, where the contextual similaritybetween candidate term context words (see step S704, above) and weightedterm context words (see discussion of method 600, above) is measured bya contextual similarity score. The contextual similarity score may beobtained by a number of methods now known or to be known in the future.In one example embodiment, the contextual similarity score isrepresented by the equation “Σi Wi*Fi”, where: ‘i’ equals the number ofdistinct context words for a candidate term; ‘Wi’ equals the weight ofthe context word in the weighted term context words (‘Wi’ equals zerowhen the context word is not in a set); and ‘Fi’ equals the frequency ofthe context word with respect to the candidate term in a given corpus.

Processing proceeds to step S708, where candidate terms are ranked basedon the contextual similarity score obtained in step S706 (and discussedin the preceding paragraph). The result of this step is a list of rankedcandidate terms.

In some embodiments of the present invention, step S408 (Discover NewTerms, see FIG. 4) further includes method 800 shown in FIG. 8.Processing begins with step S802, where method 800 selects the top Kcandidate terms from the ranked list and creates a set of top Kcandidate terms, where the value of K is pre-configured. Processing thenproceeds to step S804, where method 800 removes any candidate terms fromthe set if they are also part of domain lexicon. The remaining termsfrom the list of top k candidate terms are identified as, simply,“terms” and processing for method 800 completes.

For explanation purposes, an example embodiment demonstrating thepresent invention and portions of the above-discussed methods 400, 500,600, 700, 800 (see FIGS. 4, 5, 6, 7, and 8) is provided. Referring firstto method 500 (see FIG. 5), table 900 (see FIG. 9) shows the results ofstep S514 on an example corpus. In this example, ‘n’ equals ‘4’. Table900 begins with row 902, which shows the result of a frequent unigramextraction on the example corpus (showing both the extracted unigramsand their corresponding frequencies).

Referring still to table 900, row 904 shows the result of a frequentbigram extraction on the example corpus (showing both the extractedbigrams and their corresponding frequencies). The frequency thresholdfor the bigram extraction is 30; as such, bigrams with a frequency ofless than 30 (none, in this example) will not be included in the outputfor step S514.

Referring still to table 900 (see FIG. 9), row 906 shows the result of afrequent trigram extraction on the example corpus (showing both theextracted trigrams and their corresponding frequencies). The frequencythreshold for the trigram extraction is 20; as such, trigrams with afrequency of less than 20 (none, in this example) will not be includedin the output for step S514.

Still referring to table 900 (see FIG. 9), row 908 shows the result of afrequent 4-gram extraction on the example corpus (showing both theextracted 4-grams and their corresponding frequencies). The frequencythreshold for the 4-gram extraction is 10; as such, trigrams with afrequency of less than 10 (none, in this example) will not be includedin the output for step S514. The resulting example output of row 908,combined with the output from rows 902, 904, and 906, make up the entirelist of frequent n-grams generated by step S514.

Table 1000 (see FIG. 10) shows the results of step S516 (see FIG. 5),where association rules are generated from the list of frequent n-gramsgenerated by the previous step S514. Specifically, row 1002 (see FIG.10) shows the generated forward rules, along with their correspondingconfidence values (see discussion of step S516, above). Although a giventerm can have multiple forward rules, in the present example, for eachterm, only the forward rule with the maximum confidence value is shown.Row 1004 (see FIG. 10) similarly shows the generated inverse rules,along with their corresponding confidence values (again, see discussionof step S516, above). Although a given term can have multiple inverserules, in the present example, for each term, only the inverse rule withthe maximum confidence value is shown.

Table 1100 (see FIG. 11) shows the results of steps S518 and S520 (seeFIG. 5), where the n-grams created in step S514 are filtered using theinverse rules and the forward rules generated in step S516. In stepS518, the inverse rules are applied to the list of frequent n-grams. Foreach inverse rule over a pre-defined confidence value threshold (in thepresent example, 0.8), the term on the right-hand side of the rule isremoved from the list of frequent n-grams. The general reasoning forthis is that when an inverse rule has a high threshold value, it isunlikely that the term on the right-hand side would exist independentlyseparate from the term on the left hand side. To provide an example,because the inverse rule “ManufacturerA<-PhoneD” has a confidence valueof 1.00, “Phone D,” which is on the right hand side of the rule, isremoved from the list of frequent n-grams (as “Phone D” is unlikely toappear without “ManufacturerA” as a prefix). Row 1102 of table 1100shows all of the n-grams that have been filtered using the inverserules, and row 1104 shows the n-grams that remain after that filtering.

In step S520, the forward rules are applied to the list of frequentn-grams. For each remaining n-gram in the list of frequent n-grams, then-gram is removed from the list if it doesn't have a correspondingforward rule above a pre-defined confidence value threshold (in thepresent example, 0.8). To provide an example, because forward rule“ManufacturerB PhoneA W->4G” has a confidence value of 1.00 (which isgreater than 0.8), the term “ManufacturerB PhoneA W 4G” remains on thelist. Conversely, because the forward rule “Connect->ManufacturerA” hasa confidence value of 0.10, the term “Connect ManufacturerA” is removedfrom the list. Row 1106 of table 1100 shows all of the n-grams that havebeen filtered using the forward rules, and row 1108 shows the n-gramsthat remain after that filtering and are considered candidate terms.

Referring now to method 600 (see FIG. 6), table 1200 (see FIG. 12) showsthe results of steps S602, S604, and S606. Row 1202 shows the lexiconterms that have been extracted at the beginning of step S602. Theseterms are considered to be high precision domain lexicon terms for thedomain of smartphones (collectively, they are referred to as the “highprecision domain lexicon,” the “lexicon,” and/or the “lexicon terms”).

Continuing with step S602, method 600 identifies the n-grams from thecorpus that end with any of the terms from the domain lexicon. Method600 generates inverse rules for these n-grams along with correspondingconfidence values. If the confidence of an inverse rule exceeds apre-determined confidence value threshold (in this case, 0.8), then themethod checks if the full term of the inverse rule is included in thehigh precision domain lexicon. If so, the term on the right-hand side ofthe rule is removed from the lexicon. If not, then the right-hand sideterm is replaced in the lexicon by the full term of the inverse rule. Toprovide an example of this, row 1204 of table 1200 shows both of thegenerated inverse rules that meet the confidence value threshold in thepresent example embodiment, along with their corresponding confidencevalues. For the first rule, “ManufacturerC<-PhoneB 12,” because“ManufacturerC PhoneB 12” is already included in the lexicon, “Phone B,”(that is, the term on the right-hand side of the rule) is removed fromthe lexicon. Conversely, for the second rule, “ManufacturerB<-PhoneC,”because “ManufacturerB PhoneC” is not in the lexicon, “PhoneC” isreplaced by “ManufacturerB PhoneC” in the lexicon. The resulting,modified lexicon terms are shown in row 1206 of table 1200.

Still referring to table 1200 (see FIG. 12), row 1208 shows the resultsof step S604 (see FIG. 6), where the lexicon is scored using aC-Value/NC-Value method. As shown in table 1200, the lexicon terms areranked based on their respective scores. In the next method step S606,the top X terms are selected. In the present case, X equals 5, so allfour of the lexicon terms are selected, as shown in row 1210 of FIG. 12.

Referring still to the present example embodiment, table 1300 (see FIG.13) shows the results of steps S608, S610, and S612 (see FIG. 6). Instep S608 (shown in row 1302), context words for lexicon terms areextracted from the corpus with a pre-defined window. In the presentembodiment, the window extends to one word before the term and one wordafter the term. So, when a lexicon term is found in the corpus, the wordimmediately preceding that lexicon term and the word immediatelyfollowing the lexicon term are added to a list of context words. Thelist of context words is shown in row 1302, where each context word islisted along with the lexicon term(s) used to identify the context word.

Proceeding to step S610, a list of various closed-class words (such asdeterminers, prepositions, pronouns, and conjunctions) is used to reducethe number of words included in the list of context words. Row 1304 oftable 1300 (see FIG. 13) shows the results of step S610 in the presentexample embodiment, where words such as “to,” “from,” and “your” havebeen removed from the list.

Referring still to table 1300 (see FIG. 13), step S612 provides weightsfor the context words, thereby creating weighted term context words. Asmentioned above in the discussion of step S612, the weight of a givenword is equal to the number of lexicon terms the word appeared with inthe corpus divided by the total number of lexicon terms. The resultingweighted context words for the present example embodiment are shown inrow 1306 of table 13.

Referring now to method 700 (see FIG. 7), table 1400 (see FIG. 14) showsthe results of steps S702 and S704 for the present example embodiment.In step S702 (the results of which are shown in row 1402), context wordsfor candidate terms (see discussion of method 500, above) are extractedfrom the corpus with a predefined window. In the present embodiment, thewindow extends to one word before the term and one word after the term(as in step S608). So, when a candidate term is found in the corpus, theword immediately preceding the candidate term and the word immediatelyfollowing the candidate term are extracted and added to a list ofcontext words. Row 1402 shows the extracted context words for thepresent embodiment, along with their corresponding candidate terms. Thenumber of times a context word appears with each candidate term denotedby parentheses.

Processing continues to step S704, where closed-class context words(such as determiners, prepositions, pronouns, and conjunctions) areremoved from the list of context words in a manner similar to theremoval of closed-class context words in step S610. The resulting listof context words is shown in row 1404 of table 1400 (see FIG. 14).

Table 1500 (see FIG. 15) shows the results of steps S706 and S708 forthe present example embodiment. In step S706, for each candidate term, acontextual similarity analysis is performed between the candidate term'scontext words (produced in step S704, discussed above) and the weightedterm context words (produced in step S612, discussed above). In thepresent embodiment, the resulting contextual similarity score iscalculated by computing the sum, for each of a candidate term's contextwords, of the candidate term's frequency with that context wordmultiplied by the context word's weight. If the given context word isnot listed in the list of weighted term context words, then the weightof the context word is zero. The calculations for computing thecontextual similarity score for each candidate term in the presentexample embodiment are shown in row 1502 of table 1500 (see FIG. 5). Inthe next row 1504, the candidate terms are listed according to theirresulting contextual similarity scores (as a result of the contextualsimilarity score ranking that occurs in step S708).

Referring now to method 800, table 16 (see FIG. 16) shows the results ofsteps S802 and S804 (see FIG. 8) for the present example embodiment. Instep S802 of this embodiment, the top K candidate terms are selectedfrom the ranked list produced in step S708 (and shown in row 1504 oftable 1500 (see FIG. 15)). In this example, K equals six. The resultingdiscovered terms (that is, the top six terms from the ranked list ofcandidate terms) are shown in row 1602 of table 1600 (see FIG. 16). Inthe following step S804, the discovered terms produced in step S802 areremoved from the list of candidate terms. The terms remaining in thelist of candidate terms after this step are shown in row 1604 of table1600.

After completion of the steps in method 800, processing returns back tostep S410 in method 400 (see FIG. 4), where the newly discovered termsfrom step S802 are added to the high precision domain lexicon. The new,modified, high precision domain lexicon for the present exampleembodiment is shown in row 1606 of table 1600.

IV. Definitions

Present invention: should not be taken as an absolute indication thatthe subject matter described by the term “present invention” is coveredby either the claims as they are filed, or by the claims that mayeventually issue after patent prosecution; while the term “presentinvention” is used to help the reader to get a general feel for whichdisclosures herein that are believed as maybe being new, thisunderstanding, as indicated by use of the term “present invention,” istentative and provisional and subject to change over the course ofpatent prosecution as relevant information is developed and as theclaims are potentially amended.

Embodiment: see definition of “present invention” above—similar cautionsapply to the term “embodiment.”

and/or: inclusive or; for example, A, B “and/or” C means that at leastone of A or B or C is true and applicable.

Module/Sub-Module: any set of hardware, firmware and/or software thatoperatively works to do some kind of function, without regard to whetherthe module is: (i) in a single local proximity; (ii) distributed over awide area; (iii) in a single proximity within a larger piece of softwarecode; (iv) located within a single piece of software code; (v) locatedin a single storage device, memory or medium; (vi) mechanicallyconnected; (vii) electrically connected; and/or (viii) connected in datacommunication.

Computer: any device with significant data processing and/or machinereadable instruction reading capabilities including, but not limited to:desktop computers, mainframe computers, laptop computers,field-programmable gate array (FPGA) based devices, smart phones,personal digital assistants (PDAs), body-mounted or inserted computers,embedded device style computers, application-specific integrated circuit(ASIC) based devices.

Contextual characteristic: a feature of a term derived from that term'sparticular usage in a corpus; some examples of possible contextualcharacteristics include: (i) proximity related characteristics such asthe words located within n words of the term, the words located fartherthan n words away from a term, and/or the distance between the term anda specific, pre-identified word; (ii) frequency-related characteristicssuch as the number of times the term appears in the corpus, themost/least number of times the term appears in a sentence, and/or therelative percentage of the term compared to the other terms in thecorpus; and/or (iii) usage-related characteristics such as the locationof the term in a sentence, the location of the term in a paragraph,whether the term commonly appears in the singular form or in pluralform, whether the term regularly appears as anoun/verb/adjective/adverb/subject/object, the adjectives used todescribe the term (when a noun), the adverbs used to describe a term(when a verb), the nouns the term typically describes (when anadjective), the verbs the term typically describes (when an adverb), theobject of the term (when a subject), and/or the subject of the term(when an object).

1-7. (canceled)
 8. A computer program product comprising a computerreadable storage medium having stored thereon: first programinstructions programmed to identify a first term from a corpus, based,at least in part, on a set of initial contextual characteristic(s),where each initial contextual characteristic of the set of initialcontextual characteristic(s) relates to the contextual use of at leastone category related term of a set of category related term(s) in thecorpus; second program instructions programmed to add the first term tothe set of category related term(s), thereby creating a revised set ofcategory related term(s) and a set of first term contextualcharacteristic(s), where each first term contextual characteristic ofthe set of first term contextual characteristic(s) relates to thecontextual use of the first term in the corpus; and third programinstructions programmed to identify a second term from the corpus,based, at least in part, on the set of first term contextualcharacteristic(s).
 9. The computer program product of claim 8, furthercomprising: fourth program instructions programmed to add the secondterm to the revised set of category related term(s), thereby creating asecond revised set of category related term(s) and a set of second termcontextual characteristic(s), where each second term contextualcharacteristic of the set of second term contextual characteristic(s)relates to the contextual use of the second term in the corpus; andfifth program instructions programmed to identify a third term from thecorpus, based, at least in part, on the set of second term contextualcharacteristic(s).
 10. The computer program product of claim 8, wherein:the identifying of the second term from the corpus is further based, atleast in part, on the set of initial contextual characteristic(s). 11.The computer program product of claim 8, further comprising: fourthprogram instructions programmed to create the set of category relatedterm(s), where at least one category related term of the set of categoryrelated term(s) is extracted from the corpus using a precision orientedextraction method.
 12. The computer program product of claim 8, wherein:the first term belongs to a set of relevant term(s), where each relevantterm of the set of relevant term(s) is extracted from the corpus using astatistical extraction method.
 13. The computer program product of claim8, wherein: each initial contextual characteristic of the set of initialcontextual characteristic(s) includes a contextual weight correspondingto the respective initial contextual characteristic's use in the corpus.14. The computer program product of claim 8, wherein: the identifying ofthe first term in the corpus is further based, at least in part, on aweighted strength of a match between the first term and the respectivecontextual weights of each initial contextual characteristic in the setof initial contextual characteristic(s).
 15. A computer systemcomprising: a processor(s) set; and a computer readable storage medium;wherein: the processor set is structured, located, connected and/orprogrammed to run program instructions stored on the computer readablestorage medium; and the program instructions include: first programinstructions programmed to identify a first term from a corpus, based,at least in part, on a set of initial contextual characteristic(s),where each initial contextual characteristic of the set of initialcontextual characteristic(s) relates to the contextual use of at leastone category related term of a set of category related term(s) in thecorpus; second program instructions programmed to add the first term tothe set of category related term(s), thereby creating a revised set ofcategory related term(s) and a set of first term contextualcharacteristic(s), where each first term contextual characteristic ofthe set of first term contextual characteristic(s) relates to thecontextual use of the first term in the corpus; and third programinstructions programmed to identify a second term from the corpus,based, at least in part, on the set of first term contextualcharacteristic(s).
 16. The computer system of claim 15, furthercomprising: fourth program instructions programmed to add the secondterm to the revised set of category related term(s), thereby creating asecond revised set of category related term(s) and a set of second termcontextual characteristic(s), where each second term contextualcharacteristic of the set of second term contextual characteristic(s)relates to the contextual use of the second term in the corpus; andfifth program instructions programmed to identify a third term from thecorpus, based, at least in part, on the set of second term contextualcharacteristic(s).
 17. The computer system of claim 15, wherein: theidentifying of the second term from the corpus is further based, atleast in part, on the set of initial contextual characteristic(s). 18.The computer system of claim 15, further comprising: fourth programinstructions programmed to create the set of category related term(s),where at least one category related term of the set of category relatedterm(s) is extracted from the corpus using a precision orientedextraction method.
 19. The computer system of claim 15, wherein: thefirst term belongs to a set of relevant term(s), where each relevantterm of the set of relevant term(s) is extracted from the corpus using astatistical extraction method.
 20. The computer system of claim 15,wherein: each initial contextual characteristic of the set of initialcontextual characteristic(s) includes a contextual weight correspondingto the respective initial contextual characteristic's use in the corpus.